## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 7.4 0.70 0.00 1.9 0.076
## 2 7.8 0.88 0.00 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.70 0.00 1.9 0.076
## 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
I discuss the stats further in the Analysis section below.
From the first chart, you can clearly see most data lies in the 5-6 range for quality, which could make seeing trends difficult, especially since quality only takes integer values.
Some variables are normally distributed (pH, density), while others are skewed right (residual.sugar, chlorides), as expected from the summary statistics and discussed further below.
All of the variables are floats with the exception of quality which is an integer. There are no missing values. We have 1599 wines in the dataset, each with 12 features.
The quality variable is probably the most interesting in terms of seeing how it might relate to other features or developing a predictive model.
Any of the other features might have a trend with quality, but I’d guess some of the acid measurements and the sulfur metrics might be important for quality, while something like density is probably not as important since it has such a small range.
No I did not create any new features.
Several of the variables seem to be related to acidity - fixed.acidity, volatile.acidity, citric.acid, and pH. pH is measured on the typical scale (lower pH is more acidic), while the others are measured in \(g/dm^3\). We can see that fixed acidity dominates in terms of the relative magnitude, with a mean of 8.32, while volatile acidity has a mean of 0.53 and citric acid a mean of 0.27.
Residual sugar seems to have a pretty large range, with a min of 0.9 and max of 15.5 (in \(g/dm^3\)), but 75% of the values are below 2.6, so there is some skew. Chlorides and sulphates are similarly skewed, and chlorides and sulfur dioxides are a few orders of magnitude smaller (units are different for SO2). Density has barely any variability (~1%).
The alcohol content (abv) varies from a minimun of 8.4 to a maximum of 14.9 with the median value at 10.20. This seems a little low for red wines, but I suppose this particular type (Vinho Verde) may tend to be lower alcohol.
Finally, quality ratings are on a 1-10 scale, but we only have measurements in the 3-8 range and at least 50% of the wines were rated 5 or 6. The mean rating is 5.64.
The pairs plot shows the distributions as seen above, some simple scatterplots and also the correlations between variables. None have a very high correlation with quality, but alcohol has the highest at ~0.48.
I made a few more histograms below to examine some of the variables closer and see some initial binary relationships using color.
From the last two charts, it seems like higher quality wines tend to be slighlty higher in alcohol and lower in volatile acidity. I didn’t see anything jump out at me from the other plots. I looked into the alcohol a little further.
You can definitely see that wines in the 9-11 abv range get a lot of 5 ratings, while wines in the ~11.5-13 range get a higher percentage of 6’s and even quite a few 7’s. There are no 8’s below ~9.8 and no 3’s above ~11.
I tried some boxplots, splitting up the wines by quality rating, to see if these trends popped out any more.
These do a better job of showing some of the trends mentioned above, and also show that increased sulphate content trends with higher quality as does citric acid level. This makes sense since sulphates help preserve the wine, and citric acid in small amounts is crisp and refreshing. It should be noted that the quality in the 4-7 range should be more strongly considered since there are relatively few points outside that range.
Density plots can be an interesting way to show the variation in a feature, and here I split by quality levels to show some of the trends discovered above.
These are interesting because you can see not only from the location of the peaks where a feature is centered, but also variable it is within a quality range. For instance, citric acid content in higher quality wines (7-8) is greater and somewhat less variable (at least for the 7’s). Overall these plots are somewhat tough to read, though, and don’t contain much more information than the boxplots.
I thought it would be interesting to take a slightly closer look at the acidity variables and how they are related, so I made some scatter plots.
It is interesting to me that fixed acidity, pH and citric acid have pretty clear relationships, but much less so for volatile acidity.
The surprising trend I saw was that quality scored tended to increase with alcohol content. The boxplots do a good job showing this and other trends, such as a decrease with volatile acid or a slight increase with sulphates.
There were some interesting relationships among the acidity variables, for instance that volatile acidity did not really correlate with the other metrics.
Alcohol had the strongest correlation with quality, in the positive direction, and volatile acidity had the strongest in the negative direction.
I know that residual sugar and acidity are often things that must be in balance with alcohol content to make a good wine, so i tried a few scatterplots to see if i could look at multiple variables and see any trends. Residual sugar was skewed so I log transformed that variable.
The built in color gradient was tough for me to see so I switched to a rainbow color scale. There is lots of overplotting since the quality only takes integer values, so I am using geom_jitter and adding some transparency to the points.
The fixed and volatile acidity plots may have had hints of trends (slight uptick in quality with fixed acidity but downturn with volatile acidity), but the coloration by alcohol did not seem too informative. I thought it might be easier to see patterns if I colored by quality since that is a discrete variable.
These plots are interesting because there does seem to be a trend for the higher quality wines to occupy the top right space of the graphs while the lower quality trend toward the bottom left areas (note that I had to invert volatile acidity and pH directions because higher pH is less acidic and volatile acidity actually follows the opposite trend).
Facetting these plots by quality is another way to look at the data, so I tried that below.
The trend that higher fixed acidity and alcohol correlate with higher quality can be pretty clearly seen by the plot above. The cluster of points moves up and right as the quality score increases.
This plot is sort of interesting. There is maybe a little too much going on, but it is showing that for higher quality wines, volatile acidity tends to be low (more orange/red points) and the citric acid and alcohol contents are high.
I also wanted to take a closer look at residual sugar, focusing on acidity and abv instead. Below are some scatterplots looking at sugar and its relation to some other variables like acidity, alcohol, chlorides (salt), and quality.
I don’t get very much information from these plots about any relation between residual sugar and quality. The boxplot shows that for all the quality levels the variation of residual sugar is pretty comparable and there is no clear trend. The last graph does have some color separation, showing that for a given sugar level, higher salt and lower alcohol may be correlated. But this is not very convincing and I’m not really sure what that would mean anyway.
The scatterplots showed how most higher quality wines were either higher in alcohol, acidity, or medium levels of both. This illustrates that is not just one aspect, but several, that contribute to a quality wine.
I was surprised that I couldn’t find any relation between sugar and quality or even between sugar and acidity really. I would think that it is very important in wine making to keep the sweetness and acid levels in balance.
I did not create any models.
As the boxplot above illustrates, there is a clear downward trend in volatile acidity with higher quality wine. This is underscored in the histogram below, which shows how the 6, 7 and 8 rated wines tend to fall toward the left side of the volatile acidity distribution.
This plot shows how higher quality wines tend to have higher levels of alcohol, acidity (but not volatile acidity, as discussed above), or both. The high quality wines trend toward the upper right, while low quality occupies the bottom left. Note that the scale for pH has been flipped so that acidity increases from left to right. This plot illustrates that a wine with neither very much alcohol nor acidity will probably come off tasting flat or bland and not get a high quality score. Acid helps make a wine crisp and refreshing, while alcohol give the wine some heat and a fuller body, so the two together can make some of the best wines. The majority of the wines that scored an 8 are below 3.5 pH and above 11% abv.
This figure depicts the same trend as Plot 2, in a slightly different manner. Here the plot has been faceted by quality score, which makes it easy to see the cluster of points move up and to the right as quality increases.
This analysis looked at a dataset on Vinho Verde Red Wines containing almost 1600 wines. The data contained several metrics measuring each wines alcohol, acidity, sweetness, and other factors, as well as a quality score on a 1 to 10 scale. For this analysis, I thought it would be most interesting to try to find relationships between the featuers of the wine and the quality score. The first difficulty with this was that quality took on discrete, integer values only, and the majority were either 5 or 6, so this obscured some of the trends I was looking for. One way I overcame this was to group wines of similar quality together and look at the “average” wine (and range of wines) that received that score, for example with the boxplots or density plots. Another way was to use quality as the variable to color by, since this was the only discrete variable in the dataset (more of an ordered factor really). This showed some clear trends that informed later analysis. Another difficulty was with seeing the trends in multivariate scatterplots. I was able to decipher the relationships better when using graphic tools such as changing the color gradient or faceting by quality.
I think a further investigation of this dataset could yield interesting predicitive models that could be used to estimate a wine’s quality score based on it’s chemical attributes. However, I think such a model would have a hard time being very accurate, because there is a good amount of variability among the data. A finer resolution quality scale (i.e. ratings from 1-100), might help make a better model and reduce error. Other data besides chemical features, such as where the grapes where from or the year, might be important for quality predictions as well, and it would be interesting to have the price data to see how well price correlates with quality. Still, just from this analysis, it seems that a Red Vinho Verde producer should be trying to make a wine on the higher end of alcohol or acidity (or medium levels of both) if they want a good quality score. Also other factors like having enough sulphates to preserve the wine and keeping volatile acidity low are helpful in making the highest quality wines.